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A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these 
data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One 
major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active com- 
munity participation could be a viable option to address this challenge. We present here our approach to create standards 
for assimilation and sharing of information and a system of open standards for database intercommunication. We have 
attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish 
GenomeWiki is a 'wiki'-based resource, which aims to provide an altruistic shared environment for collective annotation of 
the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this 
gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In 
contrast to other wikis, the Zebrafish GenomeWiki is a 'structured wiki' or rather a 'semantic wiki'. The Zebrafish 
GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search. 

Database URL: http://genome.igib.res.in/twiki 
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Introduction 

Recent advances in genomics have required biologists to 
revisit accepted paradigms of how genes function and 
interact, as well as how genes cooperate to modulate a 
diverse array of complex biological processes such as devel- 
opment, metabolism and behavior (1). Application of 
advanced genomic approaches in humans and model 
organisms, including worms, flies, fish and rodents, has 
generated vast amounts of data, which has been reported 
in publications and public databases (2-4). Such publica- 
tions typically describe salient features and general pat- 
terns of the genome-wide data, while the frequently 
large and multidimensional experimental data sets are pre- 
sented as supplementary information or databases. 
Furthermore, the data generation process spans multiple 
laboratories involving diverse techniques, adding to the 
complexity of data presentation. All these factors create 
barriers to data integration including incompatible file for- 
mats and improper semantics. Consequently, this collective 
wealth of genomic and functional information remains iso- 
lated with little space for downstream integrative analysis 
to advance biological understanding. 

Integration of these data sets into a common platform 
has been inhibited by the need for systematic manual cur- 
ation (5-8) of information from unstructured data sources 
(published articles and supplementary literature) and from 
structured entities (databases and other structured data 
sets). The massive volumes of dynamic bioinformatics data 
pose serious challenges to biocurators. On one hand, the 
sheer volumes of data make it impossible for a single indi- 
vidual to connect the dots; on the other hand, the dynamic 
(and sometimes volatile) information makes it nearly im- 
possible to create spatial and temporal snapshots of gene 
products and their functions across an entire genome. 
Despite these significant challenges, the literature abounds 
with models and examples of successful integration of re- 
sources and manual curation via community participation 
(9, 10). Perhaps the best-known example of a successful 
community-based curation model is the 'wiki' solution pro- 
posed by Ward Cunningham, which was systematically 
taken up by the general internet user community to 
create the large common knowledge repository Wikipedia 
(11). 

Such community participation in an open environment 
has been successfully applied for covering the long tail in 
biological annotations (12-15). These methods have also 
been tested for genomic applications. The term 'long tail' 
was popularized by Chris Anderson (16) to describe the re- 
tailing strategy of selling small quantities of large number 
of unique goods versus the large sale of fewer popular 
goods. This term has been used in science to describe 
how people tend to access the articles in popular journals 
and miss the important data found in less popular journals 



or buried deep in the supplementary material of the manu- 
scripts. So, in science, the 'long tail' is about collecting 
and connecting these missing links. In recent years, wiki- 
based annotation platforms, such as WikiProteins (17), 
WikiPathways (18) and WikiGenes (19), have enjoyed 
broad community participation. WikiProteins is a semantic 
web-based (20-22) portal modeled on wiki pages with con- 
nected knowlets of >1 million biomedical concepts. There 
has been a general trend toward using wikis as collabora- 
tive tools due to their simplicity and ease of use (8, 9). One 
major limitation of wikis as biocuration platforms is that 
the relevant data are inherently unstructured, organized 
in sentences and paragraphs, hindering text-mining and 
integrative analysis by machine logic. Structured wikis (20, 
22) have been proposed to address this problem and allow 
an easy processing and sharing of the data by the users. 
Structured wikis allow creation of relations between differ- 
ent data sets or data points using standard ontologies 
enabling machine-readable links and easing the integration 
and analysis of large data sets. 

Zebrafish (Danio rerio), a popular vertebrate model or- 
ganism, has a genome of ~1 .5 billion base pairs distributed 
over 25 chromosomes. The zebrafish reference genome 
(http://www.ensembl.org/Danio_rerio/lnfo/lndex) and 
genes and transcript annotations are readily available 
from major genome browsers and databases such as 
National Center for Biotechnology Information (http:// 
www.ncbi.nlm.nih.gov/genome/guide/zebrafish), Ensembl 
(23), University of California Santa Cruz (UCSC) genome 
browser (24) and Zebrafish Model Organism (ZFIN) data- 
bases (25). Many post-genomic data sets, including the 
complete genome of a wild zebrafish (26), Expressed 
Sequence Tags (EST) and cDNA collections; transcriptome 
and genome variations; as well as a host of related infor- 
mation including gene loci, primary transcript and alterna- 
tively processed transcripts and protein information are 
also available for this excellent model system. Specialized 
resources such as the Zebrafish Mutant Collection (27), the 
Zebrafish Mutation Project (http://www.sanger.ac.uk/ 
Projects/D_rerio/zmp/), Zebrafish Tilling Project (https:// 
webapps.fhcrc.org/science/tilling/), zinc finger nuclease- 
targeted mutations (28), zTrap (29), ZETRAP 2.0 (30) and 
Zfishbook (31) (http://zfishbook.org/) provide periodic up- 
dates regarding the transgenic and mutant lines that 
have been generated. Previously, we created the FishMap 
(32, 33) database to integrate the genome-scale informa- 
tion on zebrafish into a centralized data repository with a 
visual interface. This platform combines computational pre- 
dictions with experimental data sets, and is equipped with 
interfaces for visualization of zebrafish genome-scale data 
and integrative analysis and is widely used by the zebrafish 
scientific community. 

We have applied the structured wiki concept to create a 
semantically organized system for community curation (4) 
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of gene function in zebrafish. We involved community par- 
ticipation to manually and systematically curate informa- 
tion from published literature on gene and gene 
functions into a structured annotation portal. The commu- 
nity members were given due credit for their contributions 
using a microattribution system, which was later converted 
into an authorship in this manuscript (34, 35). The system is 
akin to a wiki in many ways, albeit with a structured 
format, which would allow for semantic integration of con- 
tent and serve as a structured interface for aggregating 
and storing information. The resource holds synchronized 
and updated annotations for a large number of zebrafish 
genes. This information is continuously enriched by collect- 
ive community inputs and is a model for community in- 
volvement for biocuration in genomics for model 
organisms. 

Materials and Methods 

Annotation protocol 

We used a community curation approach. Each volunteer 
went through an initial training process under the guid- 
ance of a curator, during which the volunteer was familiar- 
ized with the concepts of annotation, standards and 
databases and how to systematically assemble annotation 
information for a gene. We followed a standard gene an- 
notation protocol, which included literature survey and 
checking for information in biological databases. After 
the volunteers were introduced to the process and standard 
protocols, they were provided with a template GeneCard 



and asked to annotate the genes using information as 
described in the published articles and corresponding data- 
bases (Supplementary File S2). The annotation process 
began with selecting an available gene of interest and sys- 
tematically collecting information for a particular gene, 
including the gene name, gene identifier (ID), RefSeq ID 
and transcript IDs, in a preapproved format provided in 
the Zebrafish GenomeWiki. The key reference anchor 
points in the Zebrafish GenomeWiki for any gene is the 
gene name, gene ID and RefSeq ID. Therefore, we ensured 
that key reference anchor IDs correlated between the ZFIN, 
Ensembl and Zebrafish GenomeWiki databases. The anno- 
tation process also included extensive literature survey for 
both biological functions and mutant phenotypes. The vol- 
unteers were encouraged to discuss and share information 
online through online media. Curators further manually 
cross-checked all entries for inadvertent annotation errors 
(Figure 1). A self-explanatory tutorial provides a quick user 
reference guide for the new user, which is also provided as 
a link in the Zebrafish GenomeWiki web page 
(Supplementary File S1). 

Zebrafish gene nomenclature 

We used the current zebrafish reference genome from 
UCSC/Ensembl (Zv9 build) for our curation. For gene no- 
menclature, we used the official nomenclature from ZFIN. 
For each annotated gene, the ZFIN-approved gene name 
anchors its corresponding GeneCard (36) in the Zebrafish 
GenomeWiki. We have provided links and alternate no- 
menclatures available at other databases for ready 
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Figure 1. A schematic representation of the annotation protocol. The annotators were given a set of genes and a ready refer- 
ence. The annotated entries were submitted through online media and were quality checked before upload. 
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comparison. Wherever alternate or redundant gene names 
have been used, we have also retained them for reference. 

Organization of information webs and subwebs 

Zebrafish GenomeWiki is a structured wiki, hierarchically 
organized into webs and subwebs. The wiki follows a hier- 
archy wherein the main web pages are the gene pages 
providing gene level information and links to other webs 
and external resources. The wiki also incorporates FishNet, 
a search engine for zebrafish gene annotation and 
resources. FishNet uses Swish-e (Simple Web Indexing 
System for Humans — Enhanced) (http://swish-e.org/) for 
efficient and speedy retrieval of the data. It indexes the 
entire data and quickly extracts the data on the basis of 
the keyword provided. 

Data formatting and exchange 

Information from the various sources is converted into a 
uniform standard format as specified in the Zebrafish 
GenomeWiki template. The primary entry of each gene is 
a template, called the GeneCard (36). Each entry on the 
GeneCard or any of the annotation templates in the 
subwebs have defined standard input formats, which 
enable interlinking between webs and subwebs, and exter- 
nal databases. The complete list of ontologies used in the 
webs is summarized in Supplementary Table S1. In 
Zebrafish GenomeWiki, a registered user can edit as well 
as extract information. Any changes made to a particular 
web page can be tracked in web history, allowing a check 
against any kind of mismatch or obsolete information 
upload. 

Results and Discussion 

Zebrafish GenomeWiki components 

The Zebrafish GenomeWiki provides a seamless search 
interface through FishNet, a multi-resource search engine 
for zebrafish gene annotation. The FishNet query page in 
turn provides links to the corresponding GeneCard. The 
main GeneCard page is categorized into a number of 
subwebs. The subwebs attempt to cover the majority of 
important data related to zebrafish genomics, including 
but not limited to human orthologs present in zebrafish, 
mutants, morpholinos, diseases models, transposon inser- 
tion sites and information on noncoding RNAs (Figure 2). 
Each gene ID is hyperlinked to the main GeneCard entry. 
A GeneCard provides a vivid explanation of the gene. It 
consists of a number of entry points such as the Gene ID, 
GeneName, Transcript ID, Protein ID, Refseq ID, known 
GeneFunction, GeneOntology, GeneExpression and refer- 
ences. The reference section provides direct links to the 
related publications from where the information has 
been extracted. The highlighted entry under each category 



links to a new web page containing related information 
and has semantic linkouts wherever necessary. Each 
GeneCard page also provides links to external web pages. 
The Zebrafish GenomeWiki is also provided with a sandbox 
feature that guides the user on how to create new topics. 

Annotating genes in the Zebrafish GenomeWiki 

In the present version of Zebrafish GenomeWiki, ~40 stu- 
dents and curators including 30 annotators from eight dif- 
ferent laboratories and universities were involved. These 
students were involved in initial curation of ~600 genes 
and then further the quality check and upgradation of 
the database (Supplementary Figure S1). The success of 
community annotation (4, 37) primarily lies in the availabil- 
ity of standard operating protocol or format for data or- 
ganization and easy retrieval, a centralized annotation 
submission and an expert quality check to avoid any ambi- 
guity or duplication. Furthermore, the data points should 
link back to each other to establish a proper workflow. The 
Zebrafish GenomeWiki follows a uniform template on 
which the information is collected and curated. Existing 
genes already have a ready template page in the 
Zebrafish GenomeWiki, and new gene templates can be 
generated as and when required. Key reference anchor 
points such as the 'GeneName', 'Gene ID' and 'RefSeq ID' 
provide an entry point for the annotation in the Zebrafish 
GenomeWiki. Biological information collected through ex- 
tensive manual review of literature and unpublished infor- 
mation from individual laboratory web pages or databases 
are centralized and then undergo substantial quality checks 
for accuracy and precision. Finally, the curated data are 
collated into individual GeneCards in the Zebrafish 
GenomeWiki. 

Correcting annotation mistakes and database 
maintenance in the Zebrafish GenomeWiki 

The Zebrafish GenomeWiki provides a dynamic real-time 
interface for uploading and extracting information. It 
saves information regarding all updates and modifications 
performed by any user in real time. Therefore, if a user 
inadvertently introduces a mistake during the annotation 
process, the error can be easily modified by reverting back 
to the last correct version (see Supplementary File S1 for 
details). The database also permits users to download a 
database snapshot for archiving on their private servers. 

Additionally, the complete database is periodically (quar- 
terly) archived for security purposes, and the database is 
updated quarterly. The data are derived from ensemble 
and ZFIN and is thereon updated automatically. 

Microattribution — ensuring credits for the 
contributors 

Success of any wiki-based resource depends on active com- 
munity participation. Because studies show that community 
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Figure 2. Screenshot of a sample GeneCard entry. In the Zebrafish GenomeWiki, the top panel of the page consists of a genome 
browser interface. The GeneCard is divided into a number of fields each of which consists of information for a particular gene. 
All the entries highlighted in blue in the information panel are the linkouts to the respective source databases. The bottom 
panel of the page provides the information about the revision history of the particular gene including an option to edit the 
page. The information about the last annotation and the annotator is available at the bottom left corner of every page. 
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members are most willing to contribute voluntarily when 
their contributions are readily recognized (34, 37), the 
Zebrafish GenomeWiki follows a microattribution system 
of credit sharing. Both correct and incorrect annotation 
contributions are tagged with the respective user's unique 
ID. This ensures that credits for all the contributions go back 
to the user in real time. The microattribution information 
for the individual contributions is provided as a subweb for 
ready reference. The microattribution system linked to con- 
tributions is also used to ensure proper credit sharing. For 
example, users who have contributed to the contents of the 
current version of Zebrafish GenomeWiki database have 
been listed as coauthors in this manuscript based on their 
microattribution credits. 

Current status and the way forward for Zebrafish 
GenomeWiki 

In the first version of the Zebrafish GenomeWiki, we have 
manually curated 600 genes for which the users have pro- 
vided biological annotations. In addition, the Zebrafish 
GenomeWiki also contains ~52 896 transcripts and 4150 
proteins. The Zebrafish GenomeWiki, we describe here, is 
a starting point toward systematically collating annotations 
for genes and gene products within a structured wiki plat- 
form. The system solves to a large extent the issues with 
unstructured data on wiki-based annotation platforms and 
provides for an alternative strategy combining the advan- 
tages of a structured database-driven annotation system 
with the openness of a wiki-based annotation system. The 
future would be to integrate this into standard ontologies 
and make available as Resource Description Framework 
(RDF) tuples, thus making it compatible with the semantic 
web technology (21). Recent technologies have enabled us 
to convert database formats to RDF tuples, and many visu- 
alization and search strategies using RDF data are 
just emerging. We hope this would significantly enrich 
ongoing projects, such as the Linked Data Initiative 
(http://linkeddata.org/). 

We understand that to encourage and sustain high- 
quality annotation activity, apart from the peer review 
system and organizational hierarchy, there should be 
enough incentives to sustain the endeavor. All open 
source and open data initiatives have inbuilt incentive 
mechanisms, which sustain the organization and the com- 
munity. We would be keen to work with journals and data- 
bases, and with upcoming initiatives like the Bioresource 
Research Impact Factor (38) and make the resource, anno- 
tations and annotator information interoperable. In future, 
the training manuals including annotation and curation 
protocols will be made available on the database to im- 
prove the quality of the data sets uploaded and minimize 
spurious information. We encourage all zebrafish research 
community members to actively participate in this 



collaborative environment for connecting the long tail for 
zebrafish genome annotation. 

Supplementary Data 

Supplementary data are available at Database Online. 
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